18.6 Noise-Contrastive Estimation¶

Noise-contrastive estimation estimates probability distribution by

\[\log_{model}(x) = \log \hat{p}(x; \theta) + c\]

where c is explicitly introduced as an approximation of \(-\log Z(\theta)\). Rather than estimate only \(\theta\), the NCE procedure treats c as just another parameter and estimate \(\theta\) and c simultaneously, using same algorithm for both. The resulting \(\log p_{model}(x)\) might not be correspond to exactly to a valid probability distribution, but it will become close and closer to being valid as the estimateof x improves.

NCE works by reducing the unsupervised learning problem of estimating p(x) to that of learning a probablistic binary classifier in which one of the categories corresponds to the data generated by the model. Specifically, we introduce a noise distribution \(p_{noise}(x)\) which is tractable to evaluate and sample from. We can now construct a model over both x and a new, binary class variable y. In the new joint model:

\[\begin{split}p_{joint}(y=1) = \frac{1}{2} \\ p_{joint}(x|y=1) = p_{model}(x) \\ p_{joint}(x|y=0) = p_{noise}(x)\end{split}\]

y is a switch variable that determines whether we will generate x from the model or from the noise distribution.

We can construct a similar joint model of training data. In this case, the switch variable determines whether we draw x from the data or from the noise.

\[\begin{split}p_{train}(y=1) = \frac{1}{2} \\ p_{train}(x|y=1) = p_{data}(x) \\ p_{train}(x|y=0) = p_{noise}(x)\end{split}\]

We can now just use standard maximum likehood learning on the supervised learning problem of fitting \(p_{joint}\) to \(p_{train}\):

\[\theta, c = argmax_{\theta, c} E_{x, y \sim p_{train}} \log p_{joint}(y|x)\]

\(p_{joint}\): a logistic regression model applied to the difference in log probabilities of the model and the noise distribution:

\[\begin{split}\begin {equation} \begin{split} p_{joint}(y=1) &= \frac{p_{model}(x)}{p_{model}(x) + p_{noise}(x)} \\ &= \frac{1}{1 + exp(\log \frac{p_{noise}(x)}{p_{model}(x)})} \\ &=\sigma(\log p_{model}(x) - \log p_{noise}(x)) \end{split} \end {equation}\end{split}\]

NCE is simple to apply as long as

\(\hat{p}_{model}\) is easy to back-propagate
\(p_{noise}(x)\) is easy to evaluate in order to evaluate \(p_{joint}(x)\)
\(p_{noise}(x)\) is easy to sample from to generate training data

When to use Noise Contrastive Estimation:

NCE most successful when applied to problem with few random variables, but it can work well even if those random variables can take on a high number of values. E.g: modeling the conditional distribution over a work given the context of the word.
Less efficient when applied to problem with many random variables. The logistic regression classifier can reject a noise sample by identifying any one variable whose value is unlikely. This means that learning slow greatly after \(p_{model}\) has learned the basic marginal statistics.

The constraint that \(p_{noise}\) must be easy to evaluate and easy to sample from can be overly restrictive. When \(p_{noise}\) is too simple, most samples are likely to be too obviously distinct from the data to force \(p_{model}\) to improve noticeably.

Like score matching and pseudolikehood, NCE does not work if only a lower bound on \(\hat{p}\) is available.

When the model distribution is copied to define a new noise distribution before each gradient step, NCE defines a procedure called self-contrastive estimation, whose expected gradient is equivalent to the expected gradient of maximum likehood.

The special case of NCE where the noise samples are those generated by the model suggests that maxium likehood can be interpreted as a procedure that forces a model to constantly learn to distinguish reality from its own evolving beliefs, while noise
Noise contrastive estimation achieves some reduced computational cost by only forcing the model to distinguish reality from a fixed baseline (the noise model).

18.6 Noise-Contrastive Estimation¶

Resource¶